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Abstract 

Estimation of distribution algorithms (EDAs) guide the search for the optimum by building 
and sampling explicit probabilistic models of promising candidate solutions. However, EDAs 
are not only optimization techniques; besides the optimum or its approximation, EDAs provide 
practitioners with a series of probabilistic models that reveal a lot of information about the 
problem being solved. This information can in turn be used to design problem-specific neigh- 
borhood operators for local search, to bias future runs of EDAs on a similar problem, or to 
create an efficient computational model of the problem. This chapter provides an introduction 
to EDAs as well as a number of pointers for obtaining more information about this class of 
algorithms. 

1 Introduction 

Estimation of distribution algorithms (EDAs) |7| E3 ESI ISl EH DM UM E3D], also called proba- 
bilistic model-building genetic algorithms and iterated density estimation evolutionary algorithms, 
view optimization as a series of incremental updates of a probabilistic model, starting with the 
model encoding the uniform distribution over admissible solutions and ending with the model that 



generates only the global optima. In the past decade and a half, EDAs have been applied to many 
challenging optimization problems [HI El HH EH ESI EU E21 ESl EH 1153 ESS EH Q22]. In many 
of these studies, EDAs were shown to solve problems that were intractable with other techniques 
or no other technique could achieve comparable results. However, the motive for the use of EDAs 
in practice is not only that these algorithms can solve difficult optimization problems, but that in 
addition to the optimum or its approximation EDAs provide practitioners with a compact compu- 
tational model of the problem represented by a series of probabilistic models |117} EU E7]. These 
probabilistic models reveal a lot of information about the problem domain, which can in turn be 
used to bias optimization of similar problems, create problem-specific neighborhood operators, and 
many other tasks. While many metaheuristics exist that essentially sample implicit probability 
distributions by using a combination of stochastic search operators, the insight into the problem 
represented by the series of explicit probabilistic models of promising candidate solutions gives 
EDAs a clear edge over most other metaheuristics. 

This chapter provides an introduction to EDAs. Additionally, the chapter presents numerous 
pointers for obtaining additional information about this class of algorithms. 

The chapter is organized as follows. Section [2] outlines the basic procedure of an EDA. Section[3] 
presents a taxonomy of EDAs based on the type of decomposition encoded by the model and the 
type of local distributions used in the model. Section [4] reviews some of the most popular EDAs. 
Section [5] discusses major research directions and past results in theoretical modeling of EDAs. 
Section [6] focuses on efficiency enhancement techniques for EDAs. Section [Jj gives pointers for 
obtaining additional information about EDAs. Section [8] summarizes and concludes the chapter. 

2 Basic EDA Procedure 

2.1 Problem Definition 

An optimization problem may be defined by specifying (1) a set of potential solutions to the problem 
and (2) a procedure for evaluating the quality of these solutions. The set of potential solutions is 
often defined using a general representation of admissible solutions and a set of constraints. The 
procedure for evaluating the quality of candidate solutions can either be defined as a function that 
is to be minimized or maximized (often referred to as an objective function or fitness function) or 
as a partial ordering operator. The task is to find a solution from the set of potential solutions that 
maximizes quality as defined by the evaluation procedure. 

As an example, let us consider the quadratic assignment problem (QAP), which is one of the 
fundamental NP-hard combinatorial problems [86] . In QAP the input consists of distances between 
n locations and flows between n facilities. The task is to find a one-to-one assignment of facilities 
to locations so that the overall cost is minimized. The cost for a pair of locations is defined as the 
product of the distance between these locations and the flow between the facilities assigned to these 
locations; the overall cost is the sum of the individual costs for all pairs of locations. Therefore, in 
QAP, potential solutions are defined as permutations that define assignments of facilities to loca- 
tions and the solution quality is evaluated using the cost function discussed above. The task is to 
minimize the cost. As another example, consider the maximum satisfiability problem for preposi- 
tional logic formulas defined in conjunctive normal form with 3 literals per clause (MAX3SAT). In 
MAX3SAT, each potential solution defines one interpretation of propositions (making each propo- 
sition either true or false) , and the quality of a solution is measured by the number of clauses that 
are satisfied by the specific interpretation. The task is to find an interpretation that maximizes the 
number of satisfied clauses. 



Without additional assumptions about the problem, one way to find the optimum is to repeat 
three main steps: 

1. Generate candidate solutions. 

2. Evaluate the generated solutions. 

3. Update the procedure for generating new candidate solutions according to the results of the 
evaluation. 

Ideally, the quality of generated solutions would improve over time and after a reasonable 
number of iterations, the execution of these three steps would generate the global optimum or its 
accurate approximation. Different algorithms implement the above three steps in different ways, 
but the key idea remains the same — iteratively update the procedure for generating candidate 
solutions so that generated candidate solutions continually improve in quality. 

2.2 EDA Procedure 

In estimation of distribution algorithms (EDAs) the central idea is to maintain an explicit proba- 
bilistic model to represent the distribution over candidate solutions, and to adjust the model based 
on the results of the evaluation of these solutions so that it will generate better candidate solutions 
in future. Note that using an explicit probabilistic model makes EDAs quite different from many 
other metaheuristics, such as genetic algorithms |48[ I74| or simulated annealing |291 I78j. in which 
the probability distribution used to generate new candidate solutions is often defined implicitly by 
a search operator or a combination of several search operators. Researchers often distinguish two 
main types of EDAs: 

Population-based EDAs. Population-based EDAs maintain a population (multiset) of candi- 
date solutions, starting with a population generated at random according to the uniform 
distribution over all admissible solutions. Each iteration starts by creating a population of 
promising candidate solutions using the selection operator, which gives preference to solutions 
of higher quality. Any popular selection method for evolutionary algorithms can be used, such 
as truncation or tournament selection |26U44j . For example, truncation selection can be used, 
which selects the top r% members of the population. A probabilistic model is then built for 
the selected solutions. New solutions are created by sampling the distribution encoded by 
the built model. The new solutions are then incorporated into the original population using 
a replacement operator. In full replacement, for example, the entire original population of 
candidate solutions is replaced by the new ones. A pseudocode of a population-based EDA 
is shown in Figure [H 

Incremental EDAs. In incremental EDAs, the population of candidate solutions is fully replaced 
by a probabilistic model. The model is initialized so that it encodes the uniform distribution 
over all admissible solutions. The model is then updated incrementally by repeating the 
process of (1) sampling several candidate solutions from the current model and (2) improving 
the model based on the evaluation of these candidate solutions and their comparison. A 
pseudocode of an incremental EDA is shown in Figure [2j 

Incremental EDAs often generate only a few candidate solutions at a time, whereas population- 
based EDAs often work with a large population of candidate solutions, building each model from 
scratch. Nonetheless, it is easy to see that the two approaches are essentially the same because 
even the population-based EDAs can be reformulated in an incremental-based manner. 



1. t<- 

2. generate population -P(O) of random solutions 

3. while termination criteria not satisfied, repeat 

4. evaluate all candidate solutions in P(t) 

5. select promising solutions S(t) from P(t) 

6. build a probabilistic model M{t) for S(t) 

7. generate new solutions 0(t) by sampling M{t) 

8. create P(t + 1) by combining 0(t) and P(t) 

9. t^-t + 1 

Figure 1: Population-based estimation of distribution algorithm. 

1. t<-0 

2. initialize model M(0) to represent the uniform distribution over admissible solutions 

3. while termination criteria not satisfied, repeat 

4. generate population P(t) of candidate solutions by sampling M(t) 

5. evaluate all candidate solutions in P(t) 

6. create new model M(t + 1) by adjusting M(t) according to evaluated P(t) 

7. t<-t + l 

Figure 2: Incremental estimation of distribution algorithm. 

The main components of an EDA thus include (1) a selection operator for selecting promising 
solutions, (2) an assumed class of probabilistic models to use for modeling and sampling, (3) a 
procedure for learning a probabilistic model for the selected solutions, (4) a procedure for sampling 
the built probabilistic model, and (5) a replacement operator for combining the populations of old 
and new candidate solutions. The procedure for learning a probabilistic model usually requires 
two subcomponents: a metric for evaluating the probabilistic models from the assumed class, and 
a search procedure for choosing a particular model based on the metric used. ED As differ mainly 
in the class of probabilistic models and the procedures used for evaluating candidate models and 
searching for a good model. 

The general outline of an EDA is quite similar to that of a traditional evolutionary algorithm 
(EA) |38| ; both guide the search toward promising solutions by iteratively performing selection and 
variation, the two key ingredients of any EA. In particular, components (1) and (5) are precisely 
the same as those used in other EAs. Components (2), (3), and (4), however, are unique to EDAs, 
and constitute their way of producing variation, as opposed to using recombination and mutation 
operators as is often done with other EAs. 



As we shall see, this alternative perspective opens a way for designing search procedures from 
principled grounds by bringing to the evolutionary computation domain a vast body of knowledge 
from the machine learning literature, and in particular from probabilistic graphical models. The 
key idea of EDAs is to look at a population of previously visited good solutions as data, learn a 
model (or theory) of that data, and use the resulting model to infer where other good solutions 
might be. This approach is powerful, allowing a search algorithm to learn and adapt itself with 
respect to the optimization problem being solved, while it is being solved. 

2.3 Simulation of an EDA by Hand 

To better understand the EDA procedure, this section presents a simple EDA simulation by hand. 
The purpose of presenting the simulation is to clarify the components of the basic EDA procedure 
and to build intuition about the dynamics of an EDA run. 

The simulation assumes that candidate solutions are represented by binary strings of fixed 
length n > 0. The objective function to maximize is onemax, which is defined as the sum of the 
bits in the input binary string (X\,X2, ■ ■ ■ , X n ): 

n 
fonemax(Xi,X2, ■ ■ ■ ,X n ) = 2_^Xi, (1) 

i=\ 

The quality of a candidate solution improves with the number of Is in the input string, and the 
optimum is the string of all Is. 

To model and sample candidate solutions, the simulation uses a probability vector [7ll76 tfT09] . A 
probability vector p for n-bit binary strings has n components, p = (pi,P2, ■ ■ ■ ,Pn)- The component 
Pi represents the probability of observing a 1 in position i of a solution string. To learn the 
probability vector, pi is set to the proportion of Is in position i observed in the selected set of 
solutions. To sample a new candidate solution (X±, X2, • • • , X n ), the components of the probability 
vector are polled and each Xi is set to f with probability pi, and to with probability 1 — pi. 

The expected outcome of the learning and sampling of the probability vector is that the popu- 
lation of selected solutions and the population of new candidate solutions have the same proportion 
of Is in each position. However, since the sampling considers each new candidate solution in- 
dependently of others, the actual proportions may vary a little from their expected values. The 
probability-vector EDA described above is typically referred to as the univariate marginal distri- 
bution algorithm (UMDA) [108] : other EDAs [71 1621176] based on the probability vector model will 
be discussed in Section [4~T1 

To keep the simulation simple, we consider a 5-bit onemax, a population of size N = 6, and 
truncation selection with threshold r = 50%. Recall that the truncation selection with r = 50% 
selects the top half of the current population. 

Figure [3] shows the first two iterations of the EDA simulation. The initial population of candidate 
solutions is generated at random. Truncation selection then selects the best 50% of candidate 
solutions based on their evaluation using onemax to form the set of promising solutions. Next, 
the probability vector is created based on the selected solutions and the distribution encoded by 
the probability vector is sampled to generate new candidate solutions. The resulting population 
replaces the original population and the procedure repeats. 

In both iterations of the simulation, the average objective-function value in the new population 
is greater than the average value in the population before selection. The increase in the average 
quality of the population is good news for us because we want to maximize the objective function, 
but why does this happen? Since for onemax the solutions with more Is are better than those with 
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Figure 3: Simple simulation of an EDA based on the probability-vector model for onemax. 
fitness values of candidate solutions are shown inside parentheses. 
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fewer Is, selection should increase the number of Is in the population. The learning and sampling 
of the probability vector is not expected to create or destroy any bits and that is why the new 
population of candidate solutions should contain more Is than the original population (both in the 
proportion and in the actual number). Since onemax value increases with the number of Is, we can 
expect the overall quality of the population to increase over time. Ideally, every iteration should 
increase the objective- function values in the population unless no improvement is possible. 

Nonetheless, the increase of the average objective-function value tells only half the story. A 
similar increase in the quality of the population in the first iteration would be achieved by just 
repeating selection alone without the use of the probabilistic model. However, by applying selection 
alone, no new solutions are ever created and the resulting algorithm produces no variation at 
all. Since the initial population is generated at random, the EDA with selection alone would be 
just a poor algorithm for obtaining the best solution from the initial population. The learning 
and sampling of the probabilistic model provides a mechanism for both (1) improving quality of 
new candidate solutions (under certain assumptions), and (2) facilitating exploration of the set of 
admissible solutions. 

What we have seen in this simulation was an example of the simplest kind of EDAs. The 
assumed class of probabilistic model, the probability vector, has a fixed structure. Under these 
circumstances, the procedure for learning it becomes trivial because there are really no alternative 
models to choose from. This class of EDAs is quite limited in what it can do. As we shall see in a 
moment, there are other classes of EDAs that allow richer probabilistic models capable of capturing 
interactions among the variables of a given problem. More importantly, these interactions can be 
learned automatically on a problem by problem basis. This results of course in a more complex 
model building procedure, but the extra effort has been shown to be well worth it, especially when 
solving more difficult optimization problems. 



3 Taxonomy of EDA Models 

This section provides a high-level overview of the distinguishing characteristics of probabilistic 
models. The characteristics are discussed with respect to (1) the types of interactions covered by the 
model and (2) the types of local distributions. This section only focuses on the key characteristics of 
the probabilistic models; a more detailed overview of ED As for various representations of candidate 
solutions will be covered by the following sections. 

3.1 Classification Based on Problem Decomposition 

To make the estimation and sampling tractable with reasonable sample sizes, most EDAs use 
probabilistic models that decompose the problem using unconditional or conditional independence. 
The way in which a model decomposes the problem provides one important characteristic that 
distinguishes different classes of probabilistic models. Classification of probabilistic models based on 
the way they decompose a problem is relevant regardless of the types of the underlying distributions 
or the representation of problem variables. 

Most EDAs assume that candidate solutions are represented by fixed-length vectors of variables 
and they use graphical models to represent the underlying problem structure. Graphical models 
allow practitioners to represent both direct dependencies between problem variables as well as 
independence assumptions. One way to classify graphical models is to consider a hierarchy of model 
types based on the complexity of a model (please see Figure [4] for illustrative examples) |66 ^ 1841 fl24| : 

1. No dependencies. In models that assume full independence, every variable is assumed to be 
independent of any other variable. That is, the probability distribution P(Xi,X 2 , ■ ■ ■ ,X n ) of 
the vector (X\, X2, ■ . ■ , X n ) of n variables is assumed to consist of a product of the distributions 
of individual variables: 

n 

P(X l ,X 2 ,...,X n ) = '[[P(X i ). (2) 

t=i 

The simulation presented in Section 12.31 was based on a model that assumed full independence 
of binary problem variables. EDAs based on univariate models that assume full independence of 
problem variables include the equilibrium genetic algorithm (EGA) [76], the population-based 
incremental learning (PBIL) [7J, the univariate marginal distribution algorithm (UMDA) |108| . 
the compact genetic algorithm [62], the stochastic hill climbing with learning by vectors of 
normal distributions [151] . and the continuous PBIL |173j . 

2. Pairwise dependencies. In this class of models, dependencies between variables form a tree 
or forest graph. In a tree graph, each variable except for the root of the tree is conditioned on 
its parent in a tree that contains all variables. A forest graph, on the other hand, is a collection 
of disconnected trees. Again, the forest contains all problem variables. Denoting by R the set 
of roots of the trees in a forest, and by X = (X±, X2, ■ • • , X n ) the entire vector of variables, the 
distribution from this class can be expressed as: 

P(X 1 ,X 2 ,...,X n )= JJ P(X t ) rj P (Xi\parent(Xi)) (3) 

Xi&R Xi£X\R 

A special type of a tree model is sometimes distinguished, in which the variables form a sequence 
(or a chain), and each variable except for the first one depends directly on its predecessor. 



Denoting by n(i) the index of the ith variable in the sequence, the distribution is given by 

P(X 1 ,X 2 ,...,X n ) = P(X 7Til) )l\P(X w{l) \X w{l _ 1) ). (4) 

t=2 

EDAs based on models with pairwise dependencies include the mutual information maximizing 
input clustering (MIMIC) [34], EDA based on dependency trees [8], and the bivariate marginal 
distribution algorithm (BMDA) [128]. 

3. Multivariate dependencies. Multivariate models represent dependencies using either directed 
acyclic graphs or undirected graphs. Two representative models are popular in EDAs: (1) 
Bayesian networks and (2) Markov networks. A Bayesian network is represented by a directed 
acyclic graph where each node corresponds to a variable and each edge defines a direct conditional 
dependence. The probability distribution encoded by a Bayesian network can be written as 

v. 
P(X 1 ,X 2 ,...,X n )='[[P(X i \ P arents(X i )). (5) 

j=i 

A Bayesian network represents problem decomposition by conditional independence assump- 
tions; each variable is assumed to be independent of any of its antecedents in the ancestral or- 
dering of the variables, given the values of the variable's parents. Note that all models discussed 
thus far were special cases of Bayesian networks. In fact, a Bayesian network can represent an 
arbitrary multivariate distribution; however, for such a model to be practical, it is often desirable 
to consider Bayesian networks of limited complexity. 

In Markov networks (Markov random field models), two variables are assumed to be independent 
of each other given a subset of variables defining the condition if every path between these 
variables is separated by one or more variables in the condition. 

A special subclass of multivariate models is sometimes considered in which the variables are 
divided into disjoint clusters, which are independent of each other. These models are called 
marginal product models. Polytrees also represent a subclass of multivariate models in which a 
directed acyclic graph is used as the basic dependency structure but the graph is restricted so 
that at most one undirected path exists between any two vertices. 

EDAs based on models with multivariate dependencies include the factorized distribution al- 
gorithm (FDA) [105] . the learning FDA (LFDA) [105] . the estimation of Bayesian network 
algorithm (EBNA) [39] , the Bayesian optimization algorithm (BOA) [1221 1123] and its hierar- 
chical version (hBOA) |120j . the extended compact genetic algorithm (ecGA) [60J, the polytree 
EDA [184] . the continuous iterated density estimation algorithm [21], the estimation of multi- 
variate normal algorithm (EMNA) [83], and the real-coded BOA (rBOA) [2]. 

4. Full dependence. Models may be used that do not make any independence assumptions. 
However, such models must typically impose a number of other restrictions on the distribution 
to ensure that the models remain tractable for a moderate-to-large number of variables. 

There are two additional types of probabilistic models that have been used in EDAs and that 
provide a somewhat different mechanism for decomposing the problem: 

1. Grammar models. Some EDAs use stochastic or deterministic grammars to represent the 
probability distribution over candidate solutions. The advantage of grammars is that they 
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(a) Univariate model 





(b) Chain model 



(c) Forest model 






(d) Marginal product model 



(e) Bayesian network 



(f) Markov network 



Figure 4: Illustrative examples of graphical models. Problem variables are displayed as circles, 
dependencies are shown as edges between variables or clusters of variables. 

allow modeling of variable-length structures. Because of this, grammar distributions are mostly 
used as the basis for implementing genetic programming using EDAs [99], which represents 
candidate solutions using labeled trees of variable sizes. Grammar models are used for example 
in the probabilistic-grammar based EDA for genetic programming |13] , the program distribution 
estimation with grammar model (PRODIGY) |180| or the EDA based on probabilistic grammars 
with latent annotations |63|. 



Feature-based models. Feature-based models encode the distribution of the neighborhood of 
a candidate solution using position-independent substructures, which can be found in a variety of 
positions in fixed-length or variable-length solutions. This approach is used in the feature-based 
Bayesian optimization algorithm [94] • Other features may be discovered, encoded, and used 
for guiding the exploration of the space of candidate solutions. Model-directed neighborhood 
structures are also used in other EDA variants, as will be discussed in Section [ 



3.2 Classification Based on Local Distributions in Graphical Models 

Regardless of how a graphical model decomposes the problem, each model must also assume one 
or more classes of distributions to encode local conditional and marginal distributions. Some of the 
most common classes of local distributions are discussed below: 

1. Probability tables. For discrete representations, conditional and marginal probabilities can 
be encoded using probability tables, which define a probability for each relevant combination of 
values in each conditional or marginal probability term. This was the case for example in the 
simulation in Section 12.31 in which the probability distribution for each string position i was 
represented by the probability pi of a 1; the probability of a in the same position was simply 
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1 — pi. As another example, in Bayesian networks, for each variable a probability table can be 
used to define conditional probabilities of any value of the variable given any combination of 
values of the variable's parents. While probability tables cannot directly represent continuous 
probability distributions, they can be used even for real-valued representations in combination 
with a discretization method that maps real- valued variables into discrete categories; each of 
the discrete categories can then be represented using a single probability entry. Probability 
tables are used for example in UMDA |108j . BOA [123] and ecGA [60]. An example conditional 
probability table is shown in Figure [5j 

2. Decision trees or graphs, default tables. To avoid excessively large probability tables 
when many probabilities are either similar or negligible, more advanced local structures such 
as decision trees, decision graphs or default tables may be used. In decision trees, for example, 
probabilities are stored in leaves of a decision tree in which each internal node represents a test 
on a variable and the children of the node correspond to the different outcomes of the test. 
Decision trees and decision graphs can also be used in combination with real-valued variables, in 
which the leaves store a continuous distribution in some way. More advanced structures such as 
decision trees and decision graphs are used for example in the decision-graph BOA (dBOA) |125j . 
the hierarchical BOA (hBOA) |120j . and the mixed BOA jllOj . An example decision tree for 
representing conditional probabilities is shown in Figure [5j 

3. Multivariate, continuous distributions. The normal distribution is by far the most popular 
distribution used in EDAs to represent univariate or multivariate distributions of real-valued 
variables. A multivariate normal distribution can encode a linear correlation between the vari- 
ables using the covariance matrix, but it is often inefficient in representing many other types 
of interactions |15] IllOj . Normal distributions were used in many EDAs for real- valued vec- 
tors [1511 1 173] I2T1 [83] , although in many real- valued EDAs more advanced distributions were 
used as well. Examples of multivariate normal distributions are shown in Figure [Ua)-(c). 

4. Mixtures of distributions. A mixture distribution consists of multiple components. Each 
component is represented by a specific local probabilistic model, such as a normal distribution, 
and each component is assigned a probability. Mixture distributions were used in EDAs espe- 
cially to enable EDAs for real- valued representations to deal with real- valued distributions with 
multiple basins of attraction, in which a single-peak distribution does not suffice. Mixture dis- 
tributions were used for example in the real-valued iterated density estimation algorithms |21| 
or the real-coded BOA [2J. The use of mixture distributions is more popular in EDAs for real- 
valued representations, although mixture distributions were also used to represent distributions 
over discrete representations in which the population consists of multiple dissimilar clusters |119j 
and in multiobjective EDAs |189] 1132] . An example of a mixture of normal kernel distributions 
is shown in Figure [6(d)[ 

5. Histograms. In a number of EDAs for real-valued representations, to encode local distribu- 
tions, real- valued variables or sets of such variables are divided into rectangular regions using 
a histogram-like model, and a probability or a single probabilistic model is used to represent 
the distribution in each region. Histogram models can be seen as a special subclass of the 
decision-tree models for real-valued variables. In real-valued EDAs, histograms were used for 
example in the histogram-based continuous EDA [194] • Histogram models can also be used for 
other representations; for example, when optimizing permutations, histograms can be used to 
represent different relative ordering constraints and their importance with respect to solution 
quality [T92l[T93] . 
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Figure 5: A conditional probability table for p(X\\X2, A3, A4) and a corresponding decision tree 
that reduces the number of parameters (probabilities) from 8 to 4. 

4 Overview of ED As 

This section gives an overview of ED As based on the representation of candidate solutions, although 
some of the ED As can be used across several representations. Due to the large volume of work in 
ED As in the past two decades, we do not aim to list every single variant of an EDA discussed in 
the past; instead, we focus on some of the most important representatives. 

4.1 ED As for Fixed-Length Strings over Finite Alphabets 

ED As for candidate solutions represented by fixed-length strings over a finite alphabet can use a 
variety of model types, from simple univariate models to complex Bayesian networks with local 
structures. This section reviews some of the work in this area. Candidate solutions are assumed to 
be represented by binary strings of fixed length n, although most methods presented here can be 
extended to optimization of strings over an arbitrary finite alphabet. The section classifies ED As 
based on the order of interactions in the underlying dependency model along the lines discussed in 
Section O US E3 H23] • 

4.1.1 No interactions 

The equilibrium genetic algorithm (EGA) |7Bj and the population-based incremental learning 
(PBIL)[7] replace the population of candidate solutions represented as fixed-length binary strings 
by a probability vector (pi,P2, ■ ■ ■ ,Pn), where n is the number of bits in a string and pi denotes 
the probability of a 1 in the ith position of solution strings. Each pi is initially set to 0.5, which 
corresponds to a uniform distribution over the set of all solutions. In each iteration, PBIL generates 
s candidate solutions according to the current probability vector where s > 2 denotes the selection 
pressure. Each value is generated independently of its context (remaining bits) and thus no inter- 



actions are considered (see Figure 4(a) ). The best solution from the generated set of s solutions is 
then used to update the probability-vector entries using 

Pi =pi + X(xi - Pi), 

where A £ (0, 1) is the learning rate (say, 0.02), and x% is the ith bit of the best solution. Using the 
above update rule, the probability pi of a 1 in the ith position increases if the best solution contains 
a 1 in that position and decreases otherwise. In other words, probability-vector entries move 
toward the best solution and, consequently, the probability of generating this solution increases. 
The process of generating new solutions and updating the probability vector is repeated until some 
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(a) Multivariate normal distribution with equal stan- 
dard deviations and no covariance. 




(b) Multivariate nornral distribution with arbitrary stan- 
dard deviations for each variable (diagonal covariance 
matrix) . 




(c) Multivariate normal distribution with an arbitrary 
(non-diagonal) covariance matrix. 




-4 -4 X, 

(d) Joint normal kernels distribution. 



Figure 6: Local models for continuous distributions over real-valued variables. 

termination criteria are met; for instance, the run can be terminated if all probability- vector entries 
are sufficiently close to either or 1. 

Prior work refers to PBIL also as the hill climbing with learning (HCwL) [82] and the incremental 
univariate marginal distribution algorithm (IUMDA) |102| . 

PBIL is an incremental EDA, because it proceeds by executing incremental updates of the 
model using a small sample of candidate solutions. However, there is a strong correlation between 
the learning rate in PBIL and the population size in population-based ED As or other evolutionary 
algorithms; essentially, decreasing the learning rate A corresponds to increasing the population size. 

The compact genetic algorithm (cGA) |62[ [59] reduces the gap between PBIL and traditional 
steady-state genetic algorithms. Like PBIL, cGA replaces the population by a probability vector 
and all entries in the probability vector are initialized to 0.5. Each iteration updates the probability 
vector by mimicking the effect of a single competition between two sampled solutions, where the 
best replaces the worst, on a hypothetical population of size N. Denoting the bit in the ith position 
of the best and worst of the two sampled solutions by xi and yi, respectively, the probability- vector 
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entries are updated as follows: 



Pi + jf 


if xi = 1 and yi 
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Pi ~ N 


if Xi = and yt 


= 1 


.Pi 


otherwise 
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Although cGA uses a probability vector instead of a population, updates of the probability vec- 
tor correspond to replacing one candidate solution by another one using a population of size N 
and shuffling the resulting population using a univariate model that assumes full independence of 
problem variables. 

The univariate marginal distribution algorithm (UMDA) |109j maintains a population of so- 
lutions. Each iteration of UMDA starts by selecting a population of promising solutions using 
an arbitrary selection method of evolutionary algorithms. A probability vector is then computed 
using the selected population of promising solutions and new solutions are generated by sampling 
the probability vector. The new solutions replace the old ones and the process is repeated until 
termination criteria are met. Although UMDA uses a probabilistic model as an intermediate step 
between the original and new populations unlike PBIL and cGA, the performance, dynamics and 
limitations of PBIL, cGA, and UMDA are similar. 

PBIL, cGA, and UMDA can solve problems decomposable into subproblems of order one in 
a linear or quadratic number of fitness evaluations. However, if decomposition into single-bit 
subproblems misleads the decision making away from the optimum, these algorithms scale up 
poorly with problem size [132|, 11871 H88j . 

4.1.2 Pairwise interactions 

ED As based on pairwise probabilistic models, such as a chain, a tree or a forest, represent the first 
step toward EDAs being capable of learning variable interactions and therefore solving decompos- 
able problems of bounded order (difficulty) in a scalable manner. 

The mutual-information-maximizing input clustering (MIMIC) algorithm [34] uses a chain dis- 
tribution (see Figure [4(b)[ ) specified by (1) an ordering of string positions (variables), (2) a prob- 
ability of a 1 in the first position of the chain, and (3) conditional probabilities of every other 
position given the value in the previous position in the chain. A chain probabilistic model encodes 
the probability distribution where all positions except the first are conditionally dependent on the 
previous position in the chain. After selecting promising solutions and computing marginal and 
conditional probabilities, MIMIC uses a greedy algorithm to maximize mutual information between 
the adjacent positions in the chain. In this fashion the Kullback-Liebler divergence |81| between 
the chain and actual distributions is minimized. Nonetheless, the greedy algorithm does not guar- 
antee global optimality of the constructed model (with respect to Kullback-Liebler divergence). 
The greedy algorithm starts in the position with the minimum unconditional entropy. The chain 
is expanded by adding a new position that minimizes the conditional entropy of the new variable 
given the last variable in the chain. Once the full chain is constructed for the selected population 
of promising solutions, new solutions are generated by sampling the distribution encoded by the 
chain. 

There are two important drawbacks of using chain distributions. The first drawback is that chain 
distributions limit the expressiveness of probabilistic models by restricting dependencies between 
string positions that can be encoded. Despite that, chain distributions can encode dependencies 
between pairs of positions that can be located anywhere along the solution strings; these depen- 
dencies are not preserved by the univariate model-based EDAs. The second drawback is that there 
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is no known algorithm for learning the best chain distribution in polynomial time. Despite these 
disadvantages, the use of pairwise interactions was one of the most important steps in the devel- 
opment of ED As capable of solving decomposable problems of bounded difficulty scalably. MIMIC 
was the first discrete EDA to not only learn and use a fixed set of statistics, but it was also capable 
of identifying the statistics that should be considered to solve the problem efficiently. 

Baluja and Davies [8] use dependency trees (see Figure |4(b)[ ) to model promising solutions. Like 
in PBIL, the population is replaced by a probability vector but in this case the probability vector 
contains all pairwise probabilities. The probabilities are initialized to 0.25. Each iteration adjusts 
the probability vector according to new promising solutions acquired on the fly. A dependency 
tree encodes the probability distribution where every variable except for the root is conditioned on 
the variable's parent in the tree. A variant of Prim's algorithm for finding the minimum spanning 
tree [138J can be used to construct an optimal tree distribution. Here the task is to find a tree 
that maximizes mutual information between parents (nodes with successors) and their children 
(successors). This can be done by first randomly choosing a variable to form the root of the tree, 
and "hanging" new variables to the existing tree so that the mutual information between the parent 
of the new variable and the variable itself is maximized. In this way, the Kullback-Liebler divergence 
between the tree and actual distributions is minimized as shown in ref. |32j . Once a full tree is 
constructed, new solutions are generated according to the distribution encoded by the constructed 
dependency tree and the conditional probabilities computed from the probability vector. 

The bivariate marginal distribution algorithm (BMDA) [128] uses a forest distribution (a set 



of mutually independent dependency trees, see Figure 4(b)). This class of models is even more 
general than the class of dependency trees, because any forest that contains two or more disjoint 
trees cannot be generally represented by a tree. As a measure used to determine whether to connect 
two variables, BMDA uses a Pearson's chi-square test [98]. This measure is also used to discriminate 
the remaining dependencies in order to construct the final model. To learn a model, BMDA uses 
a variant of Prim's algorithm |138j . 

Pairwise models capture some interactions in a problem with reasonable computational over- 
head. EDAs with pairwise probabilistic models can identify, propagate, and juxtapose partial 
solutions of order two, and therefore they work well on problems decomposable into subproblems 
of order at most two [34J, |8] 11021 11281 [20] . Nonetheless, capturing only some pairwise interactions 
has still been shown to be insufficient for solving all decomposable problems of bounded difficulty 
scalably [T281I20]. 

4.1.3 Multivariate interactions 

Using general multivariate models allows powerful EDAs capable of solving problems of bounded 
difficulty quickly, accurately, and reliably [84] [97] 11051 H17] 1130] . On the other hand, learning 
distributions with multivariate interactions necessitates more complex model-learning algorithms 
that require significant computational time and still do not guarantee global optimality of the 
resulting model. Nonetheless, many difficult problems are intractable using simple models and the 
use of complex models and algorithms is necessary. 

The factorized distribution algorithm (FDA) |107| uses a fixed factorized distribution through- 
out the whole run. The model is allowed to contain multivariate marginal and conditional probabil- 
ities, but FDA learns only the probabilities, not the structure (dependencies and independencies). 
To solve a problem using FDA, we must first decompose the problem and then factorize the de- 
composition. While it is useful to incorporate prior information about the regularities in the search 
space, FDA necessitates that the practitioner is able to decompose the problem using a probabilistic 
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model ahead of time. FDA does not learn what statistics are important to process within the EDA 
framework, it must be given that information in advance. A variant of FDA where probabilistic 
models are restricted to polytrees was also proposed |184| . 

The extended compact genetic algorithm (ecGA) [60] uses a marginal product model (MPM) 
that partitions the variables into disjoint subsets (see Figure |4(d)[ ). Each partition (subset) is 
treated as a single variable and different partitions are considered to be mutually independent. To 
decide between alternative MPMs, ecGA uses a variant of the minimum description length (MDL) 
metric |146|, 11471 1148| , which favors models that allow higher compression of data (in this case, the 
selected set of promising solutions). More specifically, the Bayesian information criterion (BIC) |169j 
is used. To find a good model, ecGA uses a greedy algorithm that starts with each variable forming 
one partition (like in UMDA). Each iteration of the greedy algorithm merges two partitions that 
maximize the improvement of the model with respect to BIC. If no more improvement is possible, 
the current model is used. ecGA provides robust and scalable solution for problems that can be 
decomposed into independent subproblems of bounded order (separable problems) [163|, I162|I165| . 
However, many real-world problems contain overlapping dependencies, which cannot be accurately 
modeled by dividing the variables into disjoint partitions; this can result in poor performance of 
ecGA. 

The dependency-structure matrix genetic algorithm (DSMGA) |20H I202J, I200J uses a similar 
class of models as ecGA that splits the variables into independent clusters or linkage groups. 
However, DSMGA builds models via dependency structure matrix clustering techniques. 

The Bayesian optimization algorithm (BOA) |122] builds a Bayesian network for the population 



of promising solutions (see Figure 4(e) ) and samples the built network to generate new candidate 
solutions. BOA uses the Bayesian-Dirichlet metric subject to a maximum model-complexity con- 
straint |331 [70l [7T] to discriminate competing models, but other metrics (such as BIC) have been 
analyzed in BOA as well. In all variants of BOA, the model is constructed by a greedy algorithm 
that iteratively adds a new dependency in the model that maximizes the model quality. Other 
elementary graph operators — such as edge removals and reversals — can be incorporated, but edge 
additions are most important. The construction is terminated when no more improvement is pos- 
sible. The greedy algorithm used to learn a model in BOA is similar to the one used in ECGA. 
However, Bayesian networks can encode more complex dependencies and independencies than mod- 
els used in ECGA. Therefore, BOA is also applicable to problems with overlapping dependencies. 
BOA uses an equivalent class of models as FDA; however, BOA learns both the structure and the 
probabilities of the model. Although BOA does not require problem-specific knowledge in advance, 
prior information about the problem can be incorporated using Bayesian statistics, and the rela- 
tive influence of prior information and the population of promising solutions can be tuned by the 
user [MlfTTT]. 

A discussion of the use of Bayesian networks as an extension to tree models can also be found 
in Baluja's work [9j. An EDA that uses Bayesian networks to model promising solutions was 
independently developed by Etxeberria and Larrahaga [39], who called it the estimation of Bayesian 
network algorithm (EBNA). Miihlenbein and Mahnig |105| improved the original FDA by using 
Bayesian networks together with the greedy algorithm for learning the networks described above; 
the modification of FDA was named the learning factorized distribution algorithm (LFDA). An 
incremental version of BOA was proposed by Pelikan et al. |133j . 

The hierarchical BOA (hBOA) |120| extends BOA by employing local structures to represent 
local distributions instead of using standard conditional probability tables. This enables hBOA to 
represent distributions with high-order interactions. Furthermore, hBOA incorporates a niching 
technique called restricted tournament selection to ensure effective diversity preservation. The two 
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extensions enable hBOA to solve problems decomposable into subproblems of bounded order over 
a number of levels of difficulty of a hierarchy [196} I120J . 

Markov networks are yet another class of models that can be used to identify and use multi- 
variate interactions in ED As. Markov networks are undirected graphical models (see Figure [4(f)[ ). 
Compared to Bayesian networks, Markov networks may sometimes cover the same distribution 
using fewer edges in the dependency model, but the sampling of these models becomes more com- 
plicated than the sampling of Bayesian networks. Markov networks are used for example in the 
Markov network EDA (MN-EDA) |156| and the density estimation using Markov random fields 
algorithm (DEUM) [T76lfT77] . 

Helmholtz machines used in the Bayesian evolutionary algorithm proposed by Zhang and 
Shin |204| can also encode multivariate interactions. Helmholtz machines encode interactions by 
introducing new, hidden variables, which are connected to every variable. 

EDAs that use models capable of covering multivariate interactions can solve a wide range 
of problems in a scalable manner; promising results were reported on a broad range of problems, 
including several classes of spin-glass systems [1 1T|, I121| 11271 1178| , graph partitioning [106|, 11701 1171| , 
telecommunication network optimization [150] . silicon cluster optimization [162] . scheduling [88] . 
forest management [36], ground water remediation system design [4j[69], and others. 

4.2 EDAs for Real- Valued Vectors 

There are two basic approaches to extending EDAs for discrete fixed-length strings to other domains 
such as real- valued vectors: 

1. Map the other representation to the domain of fixed-length discrete strings, solve the discrete 
problem, and map the solution back to the problem's original representation. 

2. Extend or modify the class of probabilistic models to other domains. 

A number of studies have been published about the mapping of real- valued representations into a 
discrete one in evolutionary computation [28} [30| l4"3l 148"! 1135] ; this section focuses on EDAs from 
the second category. The approaches are classified along the lines presented in Section [3] [11711124] . 

4.2.1 Single-peak normal distributions 

The stochastic hill climbing with learning by vectors of normal distributions (SHCLVND) [151] is 
a straightforward extension of PBIL to vectors of real- valued variables using a normal distribution 
to model each variable. SHCLVND replaces the population of real-valued solutions by a vector 
of means [i = (m, . . . , /j, n ), where fj,{ denotes a mean of the distribution for the ith variable. The 



same standard deviation a is used for all variables. See Figure 6(a) for an example model. At each 
generation, a random set of solutions is first generated according to \i and a. The best solution out 
of this subset is then used to update the entries in \x by shifting each m toward the value of the ith 
variable in the best solution using an update rule similar to the one used in PBIL. Additionally, 
each generation reduces the standard deviation to make the future exploration of the search space 
narrower. A similar algorithm was independently developed by Sebag and Ducoulombier [173] , who 
also discussed several approaches to evolving a standard deviation for each variable. 

4.2.2 Mixtures of normal distributions 

The probability density function of a normal distribution is centered around its mean and decreases 
exponentially with square distance from the mean. If there are multiple "clouds" of values, a normal 
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distribution must either focus on only one of these clouds, or it can embrace multiple clouds at the 
expense of including the area between them. In both cases, the resulting distribution cannot model 
the data accurately. One way of extending standard single-peak normal-distribution models to 
enable coverage of multiple groups of similar points is to use a mixture of normal distributions. Each 
component of the mixture of normal distributions is a normal distribution by itself. A coefficient is 
specified for each component of the mixture to denote the probability that a random point belongs 
to this component. The probability density function of a mixture is thus computed by multiplying 
the density function of each mixture component by the probability that a random point belongs to 
the component, and adding these weighted densities together. 

Gallagher et al. |40l |4T] extended EDAs based on single-peak normal distributions by using an 
adaptive mixture of normal distributions to model each variable. The parameters of the mixture 
(including the number of components) evolve based on the discovered promising solutions. Using 
mixture distributions is a significant improvement compared to single-peak normal distributions, 
because mixtures allow simultaneous exploration of multiple basins of attraction for each variable. 

Within the IDEA framework, Bosnian and Thierens [21] proposed IDEAs using the joint normal 
kernels distribution, where a single normal distribution is placed around each selected solution (see 
Figure [6(d)[ ). A joint normal kernels distribution can be therefore seen as an extreme use of mixture 
distributions with one mixture component per point in the training sample. The variance of each 
normal distribution can be either fixed to a relatively small value, but it should be preferable to 
adapt variances according to the current state of search. Using kernel distributions corresponds to 
using a fixed zero-mean normally distributed mutation for each promising solution as is often done 
in evolution strategies |143| . That is why it is possible to directly take up strategies for adapting 
the variance of each kernel from evolution strategies |143[ 11441 11721 [57] . 

4.2.3 Joint normal distributions and their mixtures 

What changes when instead of fitting each variable with a separate normal distribution or a mixture 
of normal distributions, groups of variables are considered together? Let us first consider using a 
single-peak normal distribution. In multivariate domains, a joint normal distribution can be defined 
by a vector of n means (one mean per variable) and a covariance matrix of size nxn. Diagonal ele- 
ments of the covariance matrix specify the variances for all variables, whereas nondiagonal elements 
specify linear dependencies between pairs of variables. Considering each variable separately corre- 
sponds to setting all nondiagonal elements in a covariance matrix to 0. Using different deviations 
for different variables allows for "squeezing" or "stretching" the distribution along the axes. On 
the other hand, using nondiagonal entries in the covariance matrix allows rotating the distribution 



around its mean. Figures 6(b) and 6(c) illustrate the difference between a joint normal distribution 
using only diagonal elements of the covariance matrix and a distribution using the full covariance 
matrix. Therefore, using a covariance matrix introduces another degree of freedom and improves 
the expressiveness of a distribution. Again, one can use a number of joint normal distributions in 
a mixture, where each component consists of its mean, covariance matrix, and weight. 

A joint normal distribution including a full or partial covariance matrix was used within the 
IDEA framework [21] and in the estimation of Gaussian networks algorithm (EGNA) [83]. Both 
these algorithms can be seen as extensions of EDAs that model each variable by a single normal 
distribution to use also nondiagonal elements of the covariance matrix. 

Bosman and Thierens [22] proposed mixed IDEAs as an extension of EDAs that use a mixture of 
normal distributions to model each variable. Mixed IDEAs allow multiple variables to be modeled 
by a separate mixture of joint normal distributions. At one extreme, each variable can have a 
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separate mixture; at another extreme, one mixture of joint distributions covering all the variables is 
used. Despite that learning such a general class of distributions is quite difficult and a large number 
of samples is necessary for reasonable accuracy, good results were reported on single-objective [22] 
as well as multiobjective problems |189} 1771 I85j . Using mixture models for all variables was also 
proposed as a technique for reducing model complexity in discrete EDAs |119j . 

Real- valued EDAs presented so far are applicable to real- valued optimization problems without 
requiring differentiability or continuity of the underlying problem. However, if it is possible to 
at least partially differentiate the problem, gradient information can be used to incorporate some 
form of gradient-based local search and the performance of real- valued EDAs can be significantly 
improved. A study on combining real- valued EDAs within the IDEA framework with gradient-based 
local search can be found in ref. [24] . 

One of the crucial limitations of using estimation of real- valued distributions is that real- valued 
EDAs have a tendency to lose diversity too fast even when the problem is relatively easy to solve [19] ; 
for example, maximum likelihood estimation and sampling of a normal distribution will lead to 
diversity loss even while climbing a simple linear slope. That is why several EDAs were proposed 
that aim to control variance of the probabilistic model so that the loss of variance is avoided and yet 
the effective exploration is not hampered by an overly large variance of the model. For example, the 
adapted maximum-likelihood Gaussian model iterated density-estimation evolutionary algorithm 
(AMaLGaM) scales up the covariance matrix to prevent premature convergence on slopes [171118]. 

4.2.4 Other real-valued EDAs 

Using normal distributions is not the only approach to modeling real-valued distributions. Other 
density functions are frequently used to model real- valued probability distributions, including his- 
togram distributions, interval distributions, and others. A brief review of real-valued EDAs that 
use other than normal distributions follows. 

In the algorithm proposed by Servet et al. |174| . an interval (cti, bi) and a number z, £ (0, 1) are 
stored for each variable. By z%, the probability that the ith variable is in the lower half of (aj,6j) 
is denoted. Each Zi is initialized to 0.5. To generate a new candidate solution, the value of each 
variable is selected randomly from the corresponding interval. The best solution is then used to 
update the value of each Z{. If the value of the ith variable of the best solution is in a lower half 
of (a,i, bi), Zi is shifted toward 0; otherwise, Zi is shifted toward 1. When z% gets close to 0, interval 
(di, bi) is reduced to its lower half; if Zi gets close to 1, interval (en, bi) is reduced to its upper half. 

EDAs proposed in refs. |21[ I195| use empirical histograms to model each variable as opposed 
to using a single normal distribution or a mixture of normal distributions. In these approaches, a 
histogram for each single variable is constructed. New points are then generated according to the 
distribution encoded by the histograms for all variables. The sampling of a histogram proceeds 
by first selecting a particular bin based on its relative frequency, and then generating a random 
point from the interval corresponding to the bin. It is straightforward to replace the histograms in 
the above methods by various classification and discretization methods of statistics and machine 
learning (such as fc-means clustering) [28] . 

Pelikan et al. |126|. 1135] use an adaptive mapping from the continuous domain to the discrete 
one in combination with discrete EDAs. The population of promising solutions is first discretized 
using equal-width histograms, equal-height histograms, fc-means clustering, or other classification 
techniques. A population of promising discrete solutions is then selected. New points are created 
by applying a discrete recombination operator to the selected population of promising discrete 
solutions. For example, new solutions can be generated by building and sampling a Bayesian 
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network like in BOA. The resulting discrete solutions are then mapped back into the continuous 
domain by sampling each class (a bin or a cluster) using the original values of the variables in 
the selected population of continuous solutions (before discretization). The resulting solutions are 
perturbed using one of the adaptive mutation operators of evolution strategies [1431 11441 11721 I57j . 
In this way, competent discrete EDAs can be combined with advanced methods based on adaptive 
local search in the continuous domain. A related approach was proposed by Chen and Chen [30], 
who propose a split-on-demand adaptive discretization method to use in combination with ecGA. 
The mixed Bayesian optimization algorithm (mBOA) developed by Ocenasek and Schwarz |110| 
models vectors of continuous variables using an extension of Bayesian networks with local structures. 
A model used in mBOA consists of a decision tree for each variable. Each internal node in the 
decision tree for a variable is a test on the value of another variable. Each test on a variable is 
specified by a particular value, which is also included in the node. The test considers two cases: the 
value of the variable is greater or equal than the value in the node or it is smaller. Each internal 
node has two children, each child corresponding to one of the two results of the test specified in this 
node. Leaves in a decision tree thus correspond to rectangular regions in the search space. For each 
leaf, the decision tree for the variable specifies a single-variable mixture of normal distributions 
centered around the values of this variable in the solutions consistent with the path to the leaf. 
Thus, for each variable, the model in mBOA divides the space reduced to other variables into 
rectangular regions, and it uses a single-variable normal kernels distribution to model the variable 
in each region. The adaptive variant of mBOA (amBOA) |114j extends mBOA by employing 
variance adaptation with the goal of maximizing effectiveness of the search for the optimum on 
real- valued problems. 

4.3 EDAs for Genetic Programming 



In genetic programming |80J, the task is to solve optimization problems with candidate solutions 
represented by labeled trees that encode computer programs or symbolic expressions. Internal 
nodes of a tree represent functions or commands; leaves represent functions with no arguments, 
variables, and constants. There are two key challenges that one must deal with when applying 
EDAs to genetic programming. Firstly, the length of programs is expected to vary and it is difficult 
to estimate how large the solution will be without solving the problem first. Secondly, small changes 
in parent-child relationships often lead to large changes in the performance of a candidate solution, 
and often the relationship between nodes in the program trees is more important than their actual 
position. Despite these challenges, even in this problem domain, EDAs have been quite successful. 
In this section we briefly outline some EDAs for genetic programming. 

The probabilistic incremental program evolution (PIPE) algorithm [153|, I154~j uses a probabilistic 
model in the form of a tree of a specified maximum allowable size. Nodes in the model specify 
probabilities of functions and terminals. PIPE does not capture any interactions between the 
nodes in the model. The model is updated by adjusting the probabilities based on the population 
of selected solutions using an update rule similar to the one in PBIL [7]. New program trees 
are generated in a top-down fashion starting in the root and continuing to lower levels of the 
tree. More specifically, if the model generates a function in a node and that function requires 
additional arguments, the successors (children) of the node are generated to form the arguments of 
the function. If a terminal is generated, the generation along this path terminates. An extension 
of PIPE named H-PIPE was later proposed |155| . In H-PIPE, nodes of a model are allowed to 
contain subroutines, and both the subroutines as well as the overall program are evolved. 

Handley [56] used tree probabilistic models to represent populations of programs (trees) in 
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genetic programming. Although the goal of this work was to compress the population of computer 
programs in genetic programming, Handley's approach can be used within the EDA framework to 
model and sample candidate solutions represented by computer programs or symbolic expressions. 
A similar model was used in estimation of distribution programming (EDP) [1991 . which extended 
PIPE by employing parent-child dependencies in candidate labeled trees. More specifically, in EDP 
the content of each node is conditioned on the node's parent. 

The extended compact genetic programming (ECGP) |164j assumes a maximum tree of max- 
imum branching like PIPE. Nonetheless, ECGP uses a marginal product model which partitions 
nodes into clusters of strongly correlated nodes. This allows ECGP to capture and exploit in- 
teractions between nodes in program trees, and solve problems that are difficult for conventional 
genetic programming and PIPE. There are four main characteristics that distinguish ECGP and 
EDP. ECGP is able to capture dependencies between more than two nodes, it learns the depen- 
dency structure based on the promising candidate trees, and it is not restricted to the dependencies 
between parents and their children. On the other hand, ECGP is somewhat limited in the ability 
of efficiently encoding long-range interactions compared to probabilistic models that do not assume 
that groups of variables must be fully independent of each other. 

Looks et al. [96] proposed to use Bayesian networks to model and sample program trees. Combi- 
natory logic is used to represent program trees in a unified manner. Program trees translated with 
combinatory logic are then modeled with Bayesian networks of BOA, EBNA, and LFDA. Contrary 
to most other EDAs for genetic programming presented in this section, in the approach of Looks et 
al. the size of computer programs is not limited, but solutions are allowed to grow over time. Looks 
later developed a more powerful framework for competent program evolution using EDAs, which 
was named meta-optimizing semantic evolutionary search (MOSES) |94 } l93 [ l95]. The key facets of 
MOSES include the division of the population into demes, the reduction of the problem of evolving 
computer programs to the one of building a representation with tunable features (knobs), and the 
use of hierarchical BOA |120| or another competent evolutionary algorithm to model demes and 
sample new candidate program solutions. 

Several EDAs for genetic programming used probabilistic models based on grammar rules |13] 
11791 11801 H42j . Most grammar-based EDAs for genetic programming use a context-free gram- 
mar. The stochastic grammar-based genetic programming (SG-GP) [1411 1142] started with a fixed 
context-free grammar with a default probability for each rule; the probabilities attached to the 
different rules were gradually adjusted based on the best candidate programs. The program evo- 
lution with explicit learning (PEEL) |179j used a probabilistic L-system with rules applicable at 
specific depths and locations; the probabilities of the rules were adapted using a variant of ant 
colony optimization (ACO) [35] . Another grammar-based EDA for genetic programming was pro- 
posed by Bosman and de Jong [13], who used a context-free grammar that is initialized to a 
minimum stochastic context-free grammar and adjusted to better fit promising candidate solutions 
by expanding rules and incorporating depth information into the rules. Grammar model-based 
program evolution (GMPE) |181l 1180] also uses a probabilistic context-free grammar. In GMPE, 
new rules are allowed to be created and old rules may be eliminated from the model. A variant 
of the minimum-message-length metric is used in GMPE to compare grammars according to their 
quality. Tanev |186] incorporated stochastic context-sensitive grammars into the grammar-guided 
genetic programming [1981 11971 155] . 
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4.4 EDAs for Permutation Problems 

In many problems, candidate solutions are most naturally represented by permutations. This is 
the case for example in many scheduling or facility location problems. These types of problems 
often contain two specific types of features or constraints that EDAs need to capture. The first 
is the absolute position of a symbol in a string and the second is the relative ordering of specific 
symbols. In some problems, such as the traveling-salesman problem, relative ordering constraints 
matter the most. In others, such as the quadratic assignment problem, both the relative ordering 
and the absolute positions matter. 

One approach to permutation problems is to apply an EDA for problems not involving per- 
mutations in combination with a mapping function between the EDA representation and the ad- 
missible permutations. For example, one may use the random key encoding [10] to transfer the 
problem of finding a good permutation into the problem of finding a high-quality real-valued vec- 
tor, allowing the use of EDAs for optimization of real-valued vectors in solving permutation-based 
problems \25\ 1149] . Random key encoding represents a permutation as a vector of real numbers. 
The permutation is defined by the reordering of the values in the vector that sorts the values in 
ascending order. The main advantage of using random keys is that any real-valued vector defines 
a valid permutation and any EDA capable of solving problems defined on vectors of real num- 
bers can thus be used to solve permutation problems. However, since EDAs do not process the 
aforementioned types of regularities in permutation problems directly their performance can often 
be poor \25\ I23j . That is why several EDAs were developed that aim to encode either type of 
constraints for permutation problems explicitly. 

To solve problems where candidate solutions are permutations of a string, Bengoetxea et al. |12j 
start with a Bayesian network model built using the same approach as in EBNA [39]. However, the 
sampling method is changed to ensure that only valid permutations are generated. This approach 
was shown to have promise in solving the inexact graph matching problem. In much the same 
way, the dependency-tree EDA (dtEDA) of Pelikan et al. [136] starts with a dependency-tree 
model [51 [32] and modifies the sampling to ensure that only valid permutations are generated. 
dtEDA for permutation problems was used to solve structured quadratic assignment problems with 
great success [136] . Bayesian networks and tree models are capable of encoding both the absolute 
position and the relative ordering constraints, although for some problem types, such models may 
turn out to be rather inefficient. 

Bosman and Thierens [25] extended the real- valued EDA to the permutation domain by storing 
the dependencies between different positions in a permutation in the induced chromosome element 
exchanger (ICE). ICE works by first using a real- valued EDA, which encodes permutations as 
real-valued vectors using the random keys encoding. ICE extends the real-valued EDA by using 
a specialized crossover operator. By applying the crossover directly to permutations instead of 
simply sampling the model, relative ordering is taken into account. The resulting algorithm was 
shown to outperform many real- valued EDAs that use the random key encoding alone |25j . 

The edge histogram based sampling algorithm (EHBSA) [1901 1193] works by creating an edge 
histogram matrix (EHM). For each pair of symbols, EHM stores the probabilities that one of these 
symbols will follow the other one in a permutation. To generate new solutions, EHBSA starts 
with a randomly chosen symbol. EHM is then sampled repeatedly to generate new symbols in the 
solution, normalizing the probabilities based on what values have already been generated. EHM 
does not take into account absolute positions at all; in order to address problems in which absolute 
positions are important, a variation of EHBSA that involved templates was proposed [190] . To 
generate new solutions, first a random string from the population was picked as a template. New 
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solutions were then generated by removing random parts of the template string and generating the 
missing parts with sampling from EHM. The resulting algorithm was shown to be better than most 
other EDAs on the traveling salesman problem. In another study, the node histogram sampling 
algorithm (NHBSA) of Tsutsui et al. |193j considers a model capable of storing node frequencies 
at each position (thereby encoding absolute position constraints) and also uses a template. 

Zhang (206|, I207| proposed to use guided mutation to optimize both permutation problems |152j 
as well as graph problems |207j . In guided mutation, the parts of the solution that are to be 
modified using a stochastic neighborhood operator are identified by analyzing a probabilistic model 
of the population of promising candidate solutions. 

5 EDA Theory 

Along with the design and application of EDAs, the theoretical understanding of these algorithms 
has improved significantly since the first EDAs were proposed. One way to classify key areas of 
theoretical study of EDAs follows [66] : 



1. Convergence proofs. Some of the most important results in EDA theory focus on the number 
of iterations of an EDA on a particular class of problems or the conditions that allow EDAs to 
provably converge to a global optimum. The convergence time (number of iterations until conver- 
gence) of UMDA on onemax for selection methods with fixed selection intensity was derived by 
Muhlenbein and Schlierkamp-Voosen [101] . The convergence of FDA on separable additively de- 
composable functions ( ADFs) was explored by Muhlenbein and Mahnig [104] , who developed an 
exact formula for convergence time when using fitness-proportionate selection. Since in practice 
fitness-proportionate selection is rarely used because of its sensitivity to linear transformations 
of the objective function, truncation selection was also examined and an equation was derived 
giving the approximate time to convergence from the analysis of the onemax function. Later, 
Muhlenbein and Mahnig [105] adapted the theoretical model to the class of general ADFs where 
subproblems were allowed to interact. Under the assumption of Boltzmann selection, theory of 
graphical models was used to derive sufficient conditions for an FDA model so that FDA with 
a large enough population is guaranteed to converge to a model that generates only the global 
optima. Zhang [205] analyzed stability of fixed points of limit models of UMDA and FDA, and 
showed that at least for some problems the chance of converging to the global optimum is in- 
deed increased when using higher order models of FDA rather than only the probability vector 
of UMDA. Convergence properties of PBIL were studied for example in refs. |52| , I73 [ I82]. 

2. Population sizing. The convergence proofs mentioned above assumed infinite populations in 
order to simplify calculations. However, in practice using an infinite population is not possible 
and the choice of an adequate population size is crucial, similarly as for other population-based 
evolutionary algorithms |451 l46 ^[6Tll58| . Using a population that is too small can lead to conver- 
gence to solutions of low quality and inability to reliably find the global optimum. On the other 
hand, using a population that is too large can lead to an increased complexity of building and 
sampling probabilistic models, evaluating populations, and executing other EDA components. 
Similar to genetic algorithms, EDAs must have a population size sufficiently large to provide 
an adequate initial supply of partial solutions in an adequate problem decomposition [461 1131] 
and to ensure that good decisions are made between competing partial solutions [58]. However, 
the population must also be large enough for EDAs to make good decisions about presence 
or absence of statistically significant variable interactions. To examine this topic, Pelikan et 
al. [131] analyzed the population size required for BOA to solve decomposable problems of 
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bounded difficulty with uniformly and nonuniformly scaled subproblems. The results showed 
that the population sizes required grew nearly linearly with the number of subproblems (or 
problem size) . The results also showed that the approximate number of evaluations grew sub- 
quadratically for uniformly scaled subproblems but was quadratic on some nonuniformly scaled 
subproblems. Yu et al. |203| refined the model of Pelikan et al. [131 j to provide a more accurate 
bound for the adequate population size in multivariate entropy-based EDAs such as ecGA and 
BOA, and also examined the effects of the selection pressure on the population size. Population 
sizing was also empirically analyzed in FDA by Muhlenbein |103| . 

3. Diversity loss. Stochastic errors in sampling can lead to a loss of diversity that may sometimes 
hamper EDA performance. Shapiro |182j examined the susceptibility of UMDA to diversity loss 
and discussed how it is necessary to set the learning parameters in such a way that this does not 
happen. Bosnian et al. |14j examined diversity loss in EDAs for solving real-valued problems 
and the approaches to alleviating this difficulty. The results showed that due to diversity loss 
some of the state-of-the-art EDAs for real- valued problems could still fail on slope-like regions in 
the search space. The authors proposed using anticipated mean shift (AMS) to shift the mean 
of new solutions each generation in order to effectively maintain diversity. 

4. Memory complexity. Another factor of importance in EDA problem solving is the mem- 
ory required to solve the problem. Gao and Culberson [52] examined the space complexity of 
the FDA and BOA on additively decomposable functions where overlap was allowed between 
subfunctions. Gao and Culberson [42] proved that the space complexity of FDA and BOA is 
exponential in the problem size even with very sparse interaction between variables. While these 
results are somewhat negative, the authors point out that this only shows that EDAs have lim- 
itations and work best when the interaction structure is of bounded size. Note that one way to 
reduce the memory complexity of EDAs is to use incremental EDAs, such as PBIL [7J, cGA |62j 
or iBOA [T33] . 

5. Model accuracy. Model accuracy studies examine the accuracy of models in EDAs. Hauschild 
et al. [68j analyzed the models generated by hBOA when solving concatenated traps, random 
additively decomposable problems, hierarchical traps and two-dimensional Ising spin glasses. 
The models generated were then compared to the underlying problem structure by analyzing 
the number of spurious and correct dependencies. The results showed that the models corre- 
sponded closely to the structure of the underlying problems and that the models did not change 
significantly between consequent iterations of hBOA. The relationship between the probabilistic 
models learned by BOA and the underlying problem structure was also explored by Lima et 
al. |89| . One of the most important contributions of this study was to demonstrate the dramatic 
effect that selection has on spurious dependencies. The results showed that model accuracy 
was significantly improved when using truncation selection compared to tournament selection. 
Motivated by these results, the authors modified the complexity penalty of BOA model building 
to take into account tournament sizes when using binary tournament selection. Echegoyen et 
al. [37] also analyzed the structural accuracy of the models using EBNA on concatenated traps, 
two variants of Ising spin glass and MAXSAT. In this work two variations of EBNA were com- 
pared, one that was given the complete model structure based on the underlying problem and 
another that learned the approximate structure. The authors then examined the probability at 
any generation that the models would generate the optimal solution. The results showed that it 
was not strictly necessary to have all the interactions that were in the complete model in order to 
solve the problems. It was also discovered that in order for the algorithm to reach a solution, the 
probability of an optimal solution must always exceed a certain threshold. Finally, the effects 
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of spurious linkages on EDA performance were examined by Radetic and Pelikan |139j . The au- 
thors started by proposing a theoretical model to describe the effects of spurious (unnecessary) 
dependencies on the population sizing of EDAs. This model was then tested empirically on one- 
max and the results showed that while it would be expected that spurious dependencies would 
have little effect on population size, when niching was included the effects were substantial. 

6 Efficiency enhancement techniques for EDAs 

EDAs can solve many classes of important problems in a robust and scalable manner, oftentimes 
requiring only a low-order polynomial growth of the number of function evaluations with respect 
to the number of decision variables [5Ql [84] [97J 11071 11311 11171 1130] . However, even a low-order 
polynomial complexity is sometimes insufficient for practical application of EDAs especially when 
the number of decision variables is extremely large, when evaluation of candidate solutions is 
computationally expensive, or when there are many conflicting objectives to optimize. The good 
news is that a number of approaches exist that can be used to further enhance efficiency of EDAs. 
Some of these techniques can be adopted from genetic and evolutionary algorithms with little or no 
change. However, some techniques are directly targeted at EDAs because these techniques exploit 
some of the unique advantages of EDAs over most other metaheuristics. Specifically, some efficiency 
enhancements capitalize on the facts that the use of probabilistic models in EDAs provides a rigorous 
and flexible framework for incorporating prior knowledge about the problem into optimization, and 
that EDAs provide practitioners with a series of probabilistic models that reveal a lot of information 
about the problem. This section reviews some of the most important efficiency enhancement 
techniques for EDAs with main focus on techniques designed specifically for EDAs. 

6.1 Parallelization 

One of the most straightforward approaches to speeding up any algorithm is to distribute the 
computation over a number of computational nodes so that several computational tasks can be 
executed in parallel. There are two main bottlenecks of EDAs that are typically addressed by 
parallelization: (1) fitness evaluation, and (2) model building and sampling. If fitness evaluation 
is computationally expensive, a master-slave architecture can be used for distributing fitness eval- 
uations and collecting the results [27] . If most computational time is spent in model building and 
sampling, model building and sampling should be parallelized jHH 1115] 1112] . 

Many parallelization techniques and much of the theory can be adopted from research on par- 
allelization in genetic and evolutionary algorithms [27]. In the context of EDAs, parallelization of 
model building was discussed for example by Ocenasek et al. [Ill] 11131 1115] 1112] who proposed the 
parallel BOA and by Larranaga et al. |84j who parallelized model building in EBNA. One of the 
most impressive results in parallelization of EDAs was published by Sastry et al. |166] l5T] who pro- 
posed a highly efficient, fully parallelized implementation of cGA to solve large-scale problems with 
millions to billions of variables even with a substantial amount of external noise in the objective 
function. 

6.2 Hybridization 

An optimization hybrid combines two or more optimizers in a single procedure |72] 1183] 154"] . Typ- 
ically, a global procedure and a local procedure are combined; the global procedure is expected 
to find promising regions and the local procedure is expected to find local optima quickly within 



21 



reasonable basins of attraction. Global and local search are used in concert to find good solutions 
faster and more reliably than would be possible using either procedure alone. 

Numerous studies have proposed to combine ED As with variants of local search both in the 
discrete domain |117[I121]I140| and in the real- valued domain |16j . The main reason for combining 
ED As with local search is that by reducing the search space to the local optima, the structure of the 
problem can be identified more easily and the population-sizing requirements can be significantly 
decreased |117[ I121J . Furthermore, the search reduces to the space of basins of attraction around 
each local optimum as opposed to the space of all admissible solutions. 

However, hybridization of ED As is not restricted to the combination of an EDA with simple local 
search. As was already pointed out, probabilistic models often contain a lot of information about 
the problem. By mining these models for information about the structure and other properties of 
the problem landscape, decisions can be made about the nature and likely effectiveness of particular 
local search procedures and appropriate neighborhood structures for those procedures J9TJ [90j H00| 
1116(1159] . In turn, subsequent local search as well as the coordination of the global and local search 
in a hybrid can be managed so that excellent solutions are found quickly, reliably and accurately. 

There are two main approaches to the design of EDA-based (model-directed) hybrids with 
advanced neighborhoods: (1) Belief propagation, which uses the probabilistic model to generate 
the maximum likely instance |90[ 1100} 1116] and (2) local search with an advanced neighborhood 
structure derived from an EDA model [911 I159J . However, it is important to note that the use of 
EDA models is not limited to advanced neighborhood structures or belief propagation, and one 
may envision the use of probabilistic models to control the division of time resources between the 
global and local searcher and in a number of other tasks. 

Local search based on advanced neighborhood structures in a hill-climbing like procedure |75L 
I137J is strongly related to model-directed hybridization using ED As, although in this approach no 
estimation of distributions takes place. The basic idea is to use a linkage learning approach to 
detect important interactions between problem variables, and then run a local search based on a 
neighborhood defined by the underlying problem decomposition. 

6.3 Time Continuation 

To achieve the same solution quality, one may run an EDA or another population-based metaheuris- 
tic with a large population for one convergence epoch, or run the algorithm with a small population 
for a large number of convergence epochs with controlled restarts between these epochs |49j . Similar 
tradeoffs are involved in the design of efficient and reliable hybrid procedures where an appropriate 
division of computational resources between the component algorithms is critical. The term time 
continuation is used to refer to the tradeoffs involved |47j . 

Two important studies related to time continuation in EDAs were published by Sastry et 
al. |160[ 1161 j . Based on a theoretical model of an ECGA-based hybrid, Sastry et al. showed 
that under certain assumptions, the neighborhoods created from EDA-built models provide suffi- 
cient information for local search to succeed on its own even on classes of problems for which local 
search with standard neighborhoods performs poorly. However, in many other cases, EDA-driven 
search in a hybrid with local search based on the adaptive neighborhood should perform better, 
especially if the structure of the problem is complex and the problem is affected by external noise. 

One of the promising research directions related to time continuation in EDAs is to mine 
probabilistic models discovered by EDAs to find an optimal way to exploit time continuation 
tradeoffs, be it in an EDA alone or in an EDA-based hybrid. 
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6.4 Using Prior Knowledge and Learning from Experience 

The use of prior knowledge has had longstanding study and use in optimization. For example, 
promising partial solutions may be used to bias the initial population of candidate solutions, spe- 
cialized search operators can be designed to solve a particular class of problems, or representations 
can be biased in order to make the search for the optimum an easier task. However, one of the 
limitations of most of these approaches is that the prior knowledge must be incorporated by hand 
and the approaches are limited to one specific problem domain. 

The use of probabilistic models provides EDAs with a unique framework for incorporating prior 
knowledge into optimization because of the possibility of using Bayesian statistics to combine prior 
knowledge with data in the learning of probabilistic models [6l [64] I171| . Furthermore, the use of 
probabilistic models in EDAs provides a basis for learning from previous runs in order to solve new 
problem instances of similar type with increased speed, accuracy and reliability |64[ [67] I117J . For 
example, Hauschild and Pelikan |65[ [67] proposed to use a probability coincidence matrix to store 
probabilities of Bayesian-network dependencies between different pairs of problem variables in prior 
hBOA runs and to bias the model building in hBOA on future problem instances of similar type 
using the matrix. 

6.5 Fitness Evaluation Relaxation 

To reduce the number of objective (fitness) function evaluations, a model of the fitness function can 
be built |129[ I167[ I168J . If an advanced EDA is used that contains a complex probabilistic model, 
the model itself can be mined to provide a set of statistics that can be estimated for an accurate, 
efficient computational model of the objective function. The model is then used to replace some 
of the evaluations, possibly most of them. It was shown that the use of adequate models of the 
objective function can yield multiplicative speedups of several tens |129[ll67]ll68j . 

6.6 Incremental and Sporadic Model Building 

With sporadic model-building, the structure of the probabilistic model is built once every few 
generations and the probabilities are updated every generation |134j . With incremental model 
building, the model is built incrementally starting from the structure discovered in the previous 
iteration [39|. This allows for models that are ideally both more accurate and quicker to learn. 

7 Starting Points for Obtaining Additional Information 

This section provides pointers for obtaining additional information about EDAs. 

7.1 Introductory Books and Tutorials 

Numerous books and other publications exist that provide introduction to estimation of distribution 
algorithms and additional starting points. The following list of references includes some of them: |53|. 

m\ eh En nm ma um iess . 



7.2 Software 

The following list includes some of the popular EDA implementations available online. These 
implementations should provide a good starting point for the interested reader. Entries in the list 
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are ordered alphabetically. Note that the list is not exhaustive. 

• Adapted maximum-likelihood Gaussian model iterated density estimation evolutionary algo- 
rithm (AMaLGaM) [IB]: 



http : //homepages . cwi . nl/~bosman/source_code . php 



Bayesian optimization algorithm (BOA) |123| ; BOA with decision graphs [125J; dependency- 
tree EDA El: 



http : //medal . cs . umsl . edu/ 



Demos of aggregation pheromone system (APS) [191] and histogram-based ED As for permutation- 
based problems (EHBSA) [193] : 
http : //www . hannan -u . ac . jp/~tsutsui/research-e ,html| 



Distribution estimation using Markov random fields (DEUM) [1771 1176] : 



http://sidshakya.com/Downloads/Main.html 



Extended compact genetic algorithm [60], £-ary ECGA, BOA [123] . BOA with decision 
trees/graphs [125] . and others: 
|http://illigal. org/] 

Mixed BOA (mBOA) [TTO], adaptive mBOA (amBOA) [TT4] : 



http : // j iri . ocenasek . com/ 



Probabilistic incremental program evolution (PIPE) [154] : 



ftp : //ftp . idsia . ch/pub/raf al/ 



Real-coded BOA (rBOA) [2], multiobjective rBOA pQ: 



http : //www . evolution . re . kr/ 



• Regularity model based multiobjective EDA (RM-MEDA) [208] ; hybrid of differential evolu- 
tion and EDA [87]; model-based multiobjective evolutionary algorithm (MMEA) [206] . and 
others: 



http : //cswww . essex . ac . uk/staf f /qzhang/mypublication . htm 



7.3 Journals 

The following journals are key venues for papers on ED As and evolutionary computation, although 
papers on ED As can be found in many other journals focusing on optimization, artificial intelligence, 
machine learning, and applications. 

• Evolutionary Computation (MIT Press): 



http : //www .mitpress j ournals . org/loi/evco 



• Evolutionary Intelligence (Springer): 

http : //www . springer . com/engineering/ j ournal/12065 



Genetic Programming and Evolvable Machines (Springer): 



http : //www . springer . com/computer/ai/ j ournal/10710 



IEEE Transactions on Evolutionary Computation (IEEE Press) 



http : //ieeexplore . ieee . org/servlet/opac?punumber=4235 
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Natural Computing (Springer) 



http : //www . springer . com/computer /theoretical+computer+science/ j ournal/1 1047 



Swarm and Evolutionary Computation (Elsevier): 



http: //www. journals .elsevier.com/swarm-and-evolutionary-computation/ 



7.4 Conferences 

The following conferences provide the most important venues for publishing papers on ED As and 
evolutionary computation, although similarly as for journals, papers on ED As are often published 
in other venues. 

• ACM SIGEVO Genetic and Evolutionary Computation Conference (GECCO) 

• European Workshops on Applications of Evolutionary Computation (Evo Workshops) 

• IEEE Congress on Evolutionary Computation (CEC) 

• Main European Events on Evolutionary Computation (EvoStar) 

• Parallel Problem Solving in Nature (PPSN) 

• Simulated Evolution and Learning (SEAL) 

8 Summary and Conclusions 

ED As are a class of stochastic optimization algorithms that have been gaining popularity due to 
their ability to solve a broad array of complex problems with excellent performance and scalability. 
Moreover, while many of these algorithms have been shown to perform well with little or no problem- 
specific information, such information can be used advantageously if available. 

ED As have their roots in the fields of evolutionary computation and machine learning. From 
evolutionary computation EDAs borrow the idea of using a population of solutions that evolves 
through iterations of selection and variation. From machine learning EDAs borrow the idea of 
learning models from data, and they use the resulting models to guide the search for better solutions. 
This approach is powerful especially because it allows the search algorithm to adapt to the problem 
being solved, giving EDAs the possibility of being an effective black-box search algorithm. Since 
most real world problems have some sort of inherent structure (as opposed to being completely 
random), there is a hope that EDAs can learn such a structure, or at least parts of it, and put that 
knowledge to good use in searching for optima. 

Another key characteristic of EDAs, and one that sets them apart from other metaheuristics, lies 
in the fact that the sequence of probabilistic models learned along a particular run (or a sequence 
or runs) yields important information that can be exploited for other means. For example, such 
information can be used for building surrogate models of the objective function leading to significant 
performance speedups, for designing effective neighborhoods for local search when conventional 
neighborhoods fail, and even for learning about characteristics of an entire class of problems that 
can in turn be used to solve other instances of the same problem class. 

This chapter gave an introduction and reviewed both the history and the state of the art in EDA 
research. The basic concepts of these algorithms were presented and a taxonomy was outlined from 
the views based on the model decomposition and the type of local distributions. The most popular 
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ED As proposed in the literature were then surveyed according to the most common representations 
for candidate solutions. Finally, the major theoretical research areas and efficiency enhancement 
techniques for EDAs were highlighted. This chapter should be valuable both for those who want 
to grasp the basic ideas of EDAs as well as for those who want to have a coherent view of EDA 
research. 
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